109 research outputs found

    On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism

    Full text link
    Barrón Cedeño, LA. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/16012Palanci

    The (Undesired) Attenuation of Human Biases by Multilinguality

    Get PDF
    Some human preferences are universal. The odor of vanilla is perceived as pleasant all around the world. We expect neural models trained on human texts to exhibit these kind of preferences, i.e. biases, but we show that this is not always the case. We explore 16 static and contextual embedding models in 9 languages and, when possible, compare them under similar training conditions. We introduce and release CA-WEAT, multilingual cultural aware tests to quantify biases, and compare them to previous English-centric tests. Our experiments confirm that monolingual static embeddings do exhibit human biases, but values differ across languages, being far from universal. Biases are less evident in contextual models, to the point that the original human association might be reversed. Multilinguality proves to be another variable that attenuates and even reverses the effect of the bias, specially in contextual multilingual models. In order to explain this variance among models and languages, we examine the effect of asymmetries in the training corpus, departures from isomorphism in multilingual embedding spaces and discrepancies in the testing measures between languages

    An Empirical Analysis of NMT-Derived Interlingual Embeddings and their Use in Parallel Sentence Identification

    Get PDF
    End-to-end neural machine translation has overtaken statistical machine translation in terms of translation quality for some language pairs, specially those with large amounts of parallel data. Besides this palpable improvement, neural networks provide several new properties. A single system can be trained to translate between many languages at almost no additional cost other than training time. Furthermore, internal representations learned by the network serve as a new semantic representation of words -or sentences- which, unlike standard word embeddings, are learned in an essentially bilingual or even multilingual context. In view of these properties, the contribution of the present work is two-fold. First, we systematically study the NMT context vectors, i.e. output of the encoder, and their power as an interlingua representation of a sentence. We assess their quality and effectiveness by measuring similarities across translations, as well as semantically related and semantically unrelated sentence pairs. Second, as extrinsic evaluation of the first point, we identify parallel sentences in comparable corpora, obtaining an F1=98.2% on data from a shared task when using only NMT context vectors. Using context vectors jointly with similarity measures F1 reaches 98.9%.Comment: 11 pages, 4 figure

    1949-2019: 70 anni di TA visti attraverso i dati utilizzati

    Get PDF
    La traduzione automatica (TA) ha subìto diversi mutamenti dal 1940 ad oggi. Come in molti altri campi dell’informatica e dell’intelligenza artificiale, si è passati da risorse sviluppate ad hoc manualmente ad approcci basati sempre di più su dati preesistenti. Il presente contributo si propone di offrire una panoramica delle diverse architetture di TA e dei dati da esse richiesti, partendo dagli approcci rule-based e arrivando alle architetture statistiche, examplebased e neurali. Ognuno di questi cambiamenti ha influito sulla tipologia di dati richiesti per la costruzione di motori di TA. Se i primi approcci non richiedevano l’utilizzo di frasi allineate, con la TA statistica è diventato imprescindibile poter fare affidamento su una grande quantità di dati paralleli. Oggi, grazie all’utilizzo delle reti neurali, è possibile ottenere una traduzione di buona qualità persino per combinazioni per cui non sono disponibili dati in entrambe le lingue

    UniBO @ AMI: A Multi-Class Approach to Misogyny and Aggressiveness Identification on Twitter Posts Using AlBERTo

    Get PDF
    We describe our participation in the EVALITA 2020 (Basile et al., 2020) shared task on Automatic Misogyny Identification. We focus on task A —Misogyny and Aggressive Behaviour Identification— which aims at detecting whether a tweet in Italian is misogynous and, if so, whether it is aggressive. Rather than building two different models, one for misogyny and one for aggressiveness identification, we handle the problem as one single multi-label classification task, considering three classes: non-misogynous, non-aggressive misogynous, and aggressive misogynous. Our three-class supervised model, built on top of AlBERTo, obtains an overall F1 score of 0.7438 on the task test set (F1 = 0.8102 for the misogyny and F1 = 0.6774 for the aggressiveness task), which outperforms the top submitted model (F1 = 0.7406)

    Wikicardi : hacia la extracción de oraciones paralelas de Wikipedia

    Get PDF
    Uno de los objetivos del proyecto Tacardi (TIN2012-38523-C02-00) consiste en extraer oraciones paralelas de corpus comparables para enriquecer y adaptar traductores automáticos. En esta investigación usamos un subconjunto de Wikipedia como corpus comparable. En este reporte se describen nuestros avances con respecto a la extracción de fragmentos paralelos de Wikipedia. Primero, discutimos cómo hemos definido los tres dominios de interés -ciencia, informática y deporte-, en el marco de la enciclopedia y cómo hemos extraído los textos y demás datos necesarios para la caracterización de los artículos en las distintas lenguas. Después discutimos brevemente los modelos que usaremos para identificar oraciones paralelas y damos sólo una muestra de algunos resultados preliminares. Los datos obtenidos hasta ahora permiten vislumbran que será posible extraer oraciones paralelas de los dominios de interés a corto plazo, si bien aún no contamos con una estimación del volumen de éstos.Preprin

    Findings of the NLP4IF-2019 Shared Task on Fine-Grained Propaganda Detection

    Full text link
    We present the shared task on Fine-Grained Propaganda Detection, which was organized as part of the NLP4IF workshop at EMNLP-IJCNLP 2019. There were two subtasks. FLC is a fragment-level task that asks for the identification of propagandist text fragments in a news article and also for the prediction of the specific propaganda technique used in each such fragment (18-way classification task). SLC is a sentence-level binary classification task asking to detect the sentences that contain propaganda. A total of 12 teams submitted systems for the FLC task, 25 teams did so for the SLC task, and 14 teams eventually submitted a system description paper. For both subtasks, most systems managed to beat the baseline by a sizable margin. The leaderboard and the data from the competition are available at http://propaganda.qcri.org/nlp4if-shared-task/.Comment: propaganda, disinformation, fake news. arXiv admin note: text overlap with arXiv:1910.0251

    Answer Selection in Arabic Community Question Answering: A Feature-Rich Approach

    Get PDF
    Abstract The task of answer selection in community question answering consists of identifying pertinent answers from a pool of user-generated comments related to a question. The recent SemEval-2015 introduced a shared task on community question answering, providing a corpus and evaluation scheme. In this paper we address the problem of answer selection in Arabic. Our proposed model includes a manifold of features including lexical and semantic similarities, vector representations, and rankings. We investigate the contribution of each set of features in a supervised setting. We show that employing a feature combination by means of a linear support vector machine achieves a better performance than that of the competition winner (F 1 of 79.25 compared to 78.55)

    Proppy: A System to Unmask Propaganda in Online News

    Full text link
    We present proppy, the first publicly available real-world, real-time propaganda detection system for online news, which aims at raising awareness, thus potentially limiting the impact of propaganda and helping fight disinformation. The system constantly monitors a number of news sources, deduplicates and clusters the news into events, and organizes the articles about an event on the basis of the likelihood that they contain propagandistic content. The system is trained on known propaganda sources using a variety of stylistic features. The evaluation results on a standard dataset show state-of-the-art results for propaganda detection.Comment: propaganda, disinformation, fake new
    corecore